Name: Vinit Nalawade
In [1]:
#import required libraries
import pandas as pd
import numpy as np
#for counter operations
from collections import Counter
#for plotting graphs
import matplotlib.pyplot as plt
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
%matplotlib inline
Part One: Go to the Social Security Administration US births website and select the births table there and copy it to your clipboard. Use the pandas read_clipboard function to read the table into Python, and use matplotlib to plot male and female births for the years covered in the data.
In [2]:
#informing python that ',' indicates thousands
df = pd.read_clipboard(thousands = ',')
In [3]:
df
Out[3]:
In [4]:
#plot male and female births for the years covered in the data
plt.plot(df['Year of birth'], df['Male'], c = 'b', label = 'Male')
plt.plot(df['Year of birth'], df['Female'],c = 'r', label = 'Female')
plt.legend(loc = 'upper left')
#plt.axis([1880, 2015, 0, 2500000])
plt.xlabel('Year of birth')
plt.ylabel('No. of births')
plt.title('Total births by Sex and Year')
#double the size of plot for visibility
size = 2
params = plt.gcf()
plSize = params.get_size_inches()
params.set_size_inches((plSize[0]*size, plSize[1]*size))
plt.show()
#plt.plot(df['Year of birth'], df['Male'], c = 'b', label = 'Male')
#plt.plot(df['Year of birth'], df['Female'],c = 'r', label = 'Female')
#plt.legend(loc = 'upper left')
#plt.xlim(xmax = 2015)
#plt.xlabel('Year of birth')
#plt.ylabel('No. of births')
#plt.title('Male and Female births from 1880 to 2015')
#plt.show()
In the same notebook, use Python to get a list of male and female names from these files. This data is broken down by year of birth.
The files contain names data of the years from 1881 to 2010.
Aggregating this data in "names" dataframe below.
In [5]:
years = range(1881,2011)
pieces = []
columns = ['name','sex','births']
for year in years:
path = 'names/yob{0:d}.txt'.format(year)
frame = pd.read_csv(path,names=columns)
frame['year'] = year
pieces.append(frame)
names = pd.concat(pieces, ignore_index=True)
In [6]:
names.head()
Out[6]:
In [7]:
names.tail()
Out[7]:
Part Two: Aggregate the data for all years (see the examples in the Pandas notebooks). Use Python Counters to get letter frequencies for male and female names. Use matplotlib to draw a plot that for each letter (x-axis) shows the frequency of that letter (y-axis) as the last letter for both for male and female names.
The data is already agregated in "names" dataframe.
Getting separate dataframes for Males and Females.
Defining a List for male and female names.
In [8]:
female_names = names[names.sex == 'F']
male_names = names[names.sex == 'M']
print "For Female names"
print female_names.head()
print "\nFor Male names"
print male_names.tail()
female_list = list(female_names['name'])
male_list = list(male_names['name'])
Calculating the letter frequency for male names.
In [9]:
male_letter_freq = Counter()
#converting every letter to lowercase
for name in map(lambda x:x.lower(),male_names['name']):
for i in name:
male_letter_freq[i] += 1
male_letter_freq
Out[9]:
Calculating the letter frequency for female names.
In [10]:
female_letter_freq = Counter()
#converting every letter to lowercase
for name in map(lambda x:x.lower(),female_names['name']):
for i in name:
female_letter_freq[i] += 1
female_letter_freq
Out[10]:
Calculating the last letter frequency for male names.
In [11]:
male_last_letter_freq = Counter()
for name in male_names['name']:
male_last_letter_freq[name[-1]] += 1
male_last_letter_freq
Out[11]:
Calculating the last letter frequency for female names.
In [12]:
female_last_letter_freq = Counter()
for name in female_names['name']:
female_last_letter_freq[name[-1]] += 1
female_last_letter_freq
Out[12]:
Plot for each letter showing the frequency of that letter as the last letter for both for male and female names.
I use the OrderedDict function from collections here to arrange the letters present in counter in acsending order for plotting.
In [13]:
#for ordering items of counter in ascending order
from collections import OrderedDict
#plot of last letter frequency of male names in ascending order of letters
male_last_letter_freq_asc = OrderedDict(sorted(male_last_letter_freq.items()))
plt.bar(range(len(male_last_letter_freq_asc)), male_last_letter_freq_asc.values(), align='center')
plt.xticks(range(len(male_last_letter_freq_asc)), male_last_letter_freq_asc.keys())
plt.xlabel('Letters')
plt.ylabel('Frequency')
plt.title('Frequency of last letter for Male names')
plt.show()
#plot of last letter frequency of female names in ascending order of letters
female_last_letter_freq_asc = OrderedDict(sorted(female_last_letter_freq.items()))
plt.bar(range(len(female_last_letter_freq_asc)), female_last_letter_freq_asc.values(), align='center')
plt.xticks(range(len(female_last_letter_freq_asc)), female_last_letter_freq_asc.keys())
plt.xlabel('Letters')
plt.ylabel('Frequency')
plt.title('Frequency of last letter for Female names')
plt.show()
female_last_letter_freq_asc = OrderedDict(sorted(female_last_letter_freq.items()))
plt.plot(range(len(female_last_letter_freq_asc)), female_last_letter_freq_asc.values(), c = 'r', label = 'Female')
plt.plot(range(len(male_last_letter_freq_asc)), male_last_letter_freq_asc.values(), c = 'b', label = 'Male')
plt.xticks(range(len(male_last_letter_freq_asc)), male_last_letter_freq_asc.keys())
plt.xlabel('Letters')
plt.ylabel('Frequency')
plt.legend(loc = 'upper right')
plt.title('Frequency of last letter in names by Sex')
#double the size of plot for visibility
size = 2
params = plt.gcf()
plSize = params.get_size_inches()
params.set_size_inches((plSize[0]*size, plSize[1]*size))
plt.show()
Part Three: Now do just female names, but aggregate your data in decades (10 year) increments. Produce a plot that contains the 1880s line, the 1940s line, and the 1990s line, as well as the female line for all years aggregated together from Part Two. Evaluate how stable this statistic is. Speculate on why it is is stable, if it is, or on what demographic facts might explain any changes, if there are any. Turn in your ipython notebook file, showing the code you used to complete parts One, Two, an Three.
In [18]:
#to get the decade lists
#female_1880 = female_names[female_names['year'] < 1890]
#female_1890 = female_names[(female_names['year'] >= 1890) & (female_names['year'] < 1900)]
#female_1900 = female_names[(female_names['year'] >= 1900) & (female_names['year'] < 1910)]
#female_1910 = female_names[(female_names['year'] >= 1910) & (female_names['year'] < 1920)]
#female_1920 = female_names[(female_names['year'] >= 1920) & (female_names['year'] < 1930)]
#female_1930 = female_names[(female_names['year'] >= 1930) & (female_names['year'] < 1940)]
#female_1940 = female_names[(female_names['year'] >= 1940) & (female_names['year'] < 1950)]
#female_1950 = female_names[(female_names['year'] >= 1950) & (female_names['year'] < 1960)]
#female_1960 = female_names[(female_names['year'] >= 1960) & (female_names['year'] < 1970)]
#female_1970 = female_names[(female_names['year'] >= 1970) & (female_names['year'] < 1980)]
#female_1980 = female_names[(female_names['year'] >= 1980) & (female_names['year'] < 1990)]
#female_1990 = female_names[(female_names['year'] >= 1990) & (female_names['year'] < 2000)]
#female_2000 = female_names[(female_names['year'] >= 2000) & (female_names['year'] < 2010)]
#female_2010 = female_names[female_names['year'] >= 2010]
#another earier way to get the decade lists for females
female_1880 = female_names[female_names.year.isin(range(1880,1890))]
female_1890 = female_names[female_names.year.isin(range(1890,1900))]
female_1900 = female_names[female_names.year.isin(range(1900,1910))]
female_1910 = female_names[female_names.year.isin(range(1910,1920))]
female_1920 = female_names[female_names.year.isin(range(1920,1930))]
female_1930 = female_names[female_names.year.isin(range(1930,1940))]
female_1940 = female_names[female_names.year.isin(range(1940,1950))]
female_1950 = female_names[female_names.year.isin(range(1950,1960))]
female_1960 = female_names[female_names.year.isin(range(1960,1970))]
female_1970 = female_names[female_names.year.isin(range(1970,1980))]
female_1980 = female_names[female_names.year.isin(range(1980,1990))]
female_1990 = female_names[female_names.year.isin(range(1990,2000))]
female_2000 = female_names[female_names.year.isin(range(2000,2010))]
female_2010 = female_names[female_names.year.isin(range(2010,2011))] #just the year 2010 present
#to verify sorting of data
print female_1880.head()
print female_1880.tail()
Preparing data for the 1880s.
A counter for last letter frequencies.
In [19]:
female_1880_freq = Counter()
for name in female_1880['name']:
female_1880_freq[name[-1]] += 1
female_1880_freq
Out[19]:
Preparing data for the 1940s.
A counter for last letter frequencies.
In [20]:
female_1940_freq = Counter()
for name in female_1940['name']:
female_1940_freq[name[-1]] += 1
female_1940_freq
Out[20]:
Preparing data for the 1990s.
A counter for last letter frequencies.
In [21]:
female_1990_freq = Counter()
for name in female_1990['name']:
female_1990_freq[name[-1]] += 1
female_1990_freq
Out[21]:
Converting the frequency data from counter to dataframes after sorting the letters alphabetically.
In [22]:
#for 1880s
first = pd.DataFrame.from_dict((OrderedDict(sorted(female_1880_freq.items()))), orient = 'index').reset_index()
first.columns = ['letter','frequency']
first['decade'] = '1880s'
print first.head()
#for 1940s
second = pd.DataFrame.from_dict((OrderedDict(sorted(female_1940_freq.items()))), orient = 'index').reset_index()
second.columns = ['letter','frequency']
second['decade'] = '1940s'
print second.head()
#for 1990s
third = pd.DataFrame.from_dict((OrderedDict(sorted(female_1990_freq.items()))), orient = 'index').reset_index()
third.columns = ['letter','frequency']
third['decade'] = '1990s'
print third.head()
Aggregating all required decades (1880s, 1940s, 1990s) into a single dataframe and then into a pivot table for ease in plotting graphs.
In [23]:
#Aggregate 1880s, 1940s and 1990s frequencies
frames = [first, second, third]
columns = ["letter","frequency", "decade"]
req_decades = pd.DataFrame(pd.concat(frames))
req_decades.columns = columns
print req_decades.head()
print req_decades.tail()
#Get data into a pivot table for ease in plotting
decades_table = pd.pivot_table(req_decades, index=['letter'], values=['frequency'], columns=['decade'])
decades_table.head()
Out[23]:
Plot of last letter of females for 1880s , 1940s, 1990s, and for all years (from part 2).
In [24]:
#plot the decades as bars and the female line for all years as a line
c = ['m','g','c']
decades_table['frequency'].plot(kind = 'bar', rot = 0,color = c, title = 'Frequency of Last letter of Female names by Female Births')
#the female line for all years taken from part 2
plt.plot(range(len(female_last_letter_freq_asc)), female_last_letter_freq_asc.values(), c = 'r', label = 'All Female births')
plt.xlabel('Letters')
plt.ylabel('Frequency')
plt.legend(loc = 'best')
#double the size of plot for visibility
size = 2
params = plt.gcf()
plSize = params.get_size_inches()
params.set_size_inches((plSize[0]*size, plSize[1]*size))
plt.show()
The graph has extreme variations in highs and lows.
Plotting the logarithmic scale of frequencies takes care of this and makes it easier for comparison.
In [25]:
#plot the decades as bars and the female line for all years as a line
c = ['m','g','c']
decades_table['frequency'].plot(kind = 'bar', rot = 0, logy = 'True',color = c, title = 'Log(Frequency) of Last letter of Female names by Female Births')
#the female line for all years taken from part 2
plt.plot(range(len(female_last_letter_freq_asc)), female_last_letter_freq_asc.values(), c = 'r', label = 'All Female births')
plt.xlabel('Letters')
plt.ylabel('Log(Frequency)')
plt.legend(loc = 'best')
#double the size of plot for visibility
size = 2
params = plt.gcf()
plSize = params.get_size_inches()
params.set_size_inches((plSize[0]*size, plSize[1]*size))
plt.show()
Evaluate how stable this statistic is. Speculate on why it is is stable, if it is, or on what demographic facts might explain any changes, if there are any.
We can normalize the table by total births in each particular decades to compute a new table containing proportion of total births for each decade ending in each letter.
In [26]:
decades_table.sum()
Out[26]:
In [27]:
#plot the decades as bars and the female line for all years as a line
c = ['m','g','c']
decades_table_prop = decades_table/decades_table.sum().astype(float)
decades_table_prop['frequency'].plot(kind = 'bar', rot = 0,color = c, title = 'Normalized Frequency of Last letter of Female names by Female Births')
#the female line for all years taken from part 2
#plt.plot(range(len(female_last_letter_freq_asc)), female_last_letter_freq_asc.values(), c = 'r', label = 'All Female births')
plt.xlabel('Letters')
plt.ylabel('Normalized Frequency')
plt.legend(loc = 'best')
#double the size of plot for visibility
size = 2
params = plt.gcf()
plSize = params.get_size_inches()
params.set_size_inches((plSize[0]*size, plSize[1]*size))
plt.show()
The statistics are pretty stable for some letters, but not so stable for other letters.
The letters 'a' and 'e' while decreasing in popularity are still the leading name end letters. There has been a rise in the usage of letters 'n' and 'y' in 1940s and 1990s.
Fewer parents choose common names for their female children as we progress through the decades.
Tradition no longer dictates how female children are named.
We can see in 1880s that majority of female names end in 'a' and 'e'.
In 1940s majority of female names end in 'a', 'e', 'i', 'l', 'n', 's' and 'y' which starts a break from tradition.
In 1990s majority of female names end in 'a', 'e', 'h', 'i', 'l', 'n' and 'y' which signals that while some might still choose traditional names, fashion or trends or uniqueness in names might also dictate how females are named in recent decades.
In [ ]: